This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
This makes data cleaning especially important and tricky.
You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.
Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
The data is contained in three files:
Here is the schema and explanation of each variable in the files:
portfolio.json
profile.json
transcript.json
Running this entire notebook may take some time (in excess of 30 minutes)
This project is about the analysis of Starbucks customers who uses its mobile app. The aim is to analyze individual customer behaviors to identify unusual patterns. The finding of such an analysis should help Starbucks business to re-evaluate its rewards program based on specific findings.
Ideally, the analysis should profile customers based on their spending behaviors and should connect it with their demographic attributes.
In this context, I frame the problem aiming to understand customer behavior through available data points and then providing specific business inputs based on results.
We should aim to provide:
Is there any correlation between customer's spending habits and Starbucks reward programs?
Can we identify high paying customers who don't care about the rewards program?
Can we identify customers who haven't made any transactions in the past? Such customers might be pursued individually through a rewards program.
If possible, we should combine all the datasets, clean it, and preprocess it so that it applies to the target machine learning model.
Roughly speaking, machine learning problems can be divided into supervised, semi-supervised, unsupervised, and reinforcement learning.
Since we are going to build a supervised learning model to profile customers, let's recap some fundaments of supervised machine learning. Supervised Learning
In supervised learning, the dataset is the collection of labeled examples. Each element in the set of examples is called a feature vector. Each attribute of such a feature can be described as the dimension of a feature vector.
For instance, in the profile dataset, we have customer id, age, gender, income attributes. Each of these attributes for a specific customer is a dimension of that customer's feature vector. Based on input features, the task is to deduce a label automatically.
For instance, the model created using the dataset of profile could take as input a feature vector describing a customer and output a probability that the customer responds to a given offer.
Classification is a problem of automatically assigning a label to an unlabeled example.
In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (responds or doesn't respond to offer), it is called binary classification problem. Multiclass classification is a classification problem with three or more classes.
'''
We will be using bokeh library in addition to matplotlib and seaborn
for visualisation of data throghout this notebook.
Below is the simple line chart using the bokeh for demostration purpose.
You can use the toolbox on the right side of chart to try out different functionalities
like zoom-in, zoom-out, reset to analyse specific part of the chart/graph.
Bokeh documentatin can be found here: https://docs.bokeh.org/en/latest/index.html
'''
from bokeh.application import Application
from bokeh.application.handlers import FunctionHandler
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.layouts import row, gridplot
import scipy.special
output_notebook()
#default configuration to save bokeh chart upon creation
output_file('current_file.html', title='Bokeh Plot', mode='inline')
#example bokeh line chart
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
plot = figure(title= "simple line", x_axis_label='x', y_axis_label = 'y')
plot.line(x, y, legend_label='Temp.', line_width =2)
show(plot)
import pandas as pd
import numpy as np
import json
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
from jupyterthemes import jtplot
import warnings
import os
import catboost
import joblib
warnings.filterwarnings('ignore')
#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))
plt.rcParams['figure.figsize'] = (20.0, 10.0)
%matplotlib inline
from matplotlib.figure import Figure
from catboost import CatBoostClassifier, Pool, cv
from catboost import MetricVisualizer
from sklearn.preprocessing import LabelEncoder,robust_scale, scale
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.model_selection import train_test_split
from itertools import product, chain
from tqdm import tqdm
# choose which theme to inherit plotting style from
# onedork | grade3 | oceans16 | chesterish | monokai | solarizedl | solarizedd
jtplot.style(theme='oceans16')
# set "context" (paper, notebook, talk, poster)
# scale font-size of ticklabels, legend, etc.
# remove spines from x and y axes and make grid dashed
jtplot.style(context='talk', fscale=1.4, spines=False, gridlines='--')
# turn on X- and Y-axis tick marks (default=False)
# turn off the axis grid lines (default=True)
# and set the default figure size
jtplot.style(ticks=True, grid=False, figsize=(6, 4.5))
# reset default matplotlib rcParams
#jtplot.reset()
path = '../data/raw/'
# read in the json files
profile = pd.read_json(os.path.join(path, 'profile.json'), orient='records', lines=True)
portfolio = pd.read_json(path + 'portfolio.json', orient='records', lines=True)
transcript = pd.read_json(os.path.join(path, 'transcript.json'), orient='records', lines=True)
portfolio.shape
portfolio.head()
portfolio.describe()
portfolio['channels'].values.tolist()
Observation Looking at the 'channels' columns, it makes sense to disaggregate the each channel information into separate columns and encode using 0|1 identifiers.
Let's write a function for it.
def generic_one_hot_encoding(col, target_df):
'''
This function
- Transforms a given 'col' having multiple attributes to separate columns with 0/1 encoding values
Inputs:
- col : input column
example : channels -> [web, email, mobile, social]
Return Value:
- 'target_df' with encoded columns names like [is_web, is_email, is_mobile, is_social]
'''
all_enconded_candidates = set()
for enc_list in target_df[col].values.tolist():
for cnd in enc_list:
all_enconded_candidates.add(cnd)
all_enconded_candidates = list(all_enconded_candidates)
for cnd in all_enconded_candidates:
target_df['is_' + cnd] = target_df[col].apply(
lambda x: int(cnd in x))
target_df = target_df.drop(col, axis =1)
return target_df
portfolio = generic_one_hot_encoding('channels', portfolio)
portfolio.head(2)
Observation
Let's create dummies for offer_type column and encode it and drop the original column as we no longer required it
# We will create dummies for the column offer_type and remove the original column
portfolio = portfolio.join(pd.get_dummies(portfolio.offer_type))
portfolito =portfolio.drop(['offer_type'], axis =1 , inplace = True)
portfolio.head(5)
Let's re-order the column name to make it more appropriate (i.e. id columns should be first)
portfolio = portfolio[
['id', 'difficulty', 'reward',
'duration','is_mobile', 'is_web',
'is_social', 'is_email','bogo',
'discount','informational']
]
portfolio
Observation
Before we move on to Profile dataset, let's briefly consider potential data scaling issue in Portfolio dataframe. We have 'difficulty', 'reward', 'duration' on different numberic scale and unit.
We should try to rescale it with mean value around 0 if that is expected by chosen machine learning algorithm.
figure, axis = plt.subplots(figsize=(20, 5), nrows=1, ncols=2)
numeric_columns = ['difficulty','reward','duration']
numeric_df = portfolio[numeric_columns]
numeric_df.plot.density(ax=axis[0])
axis[0].set_title('Unscaled data')
axis[0].set_xlabel('Unscaled value')
scaled_df = robust_scale(numeric_df)
scaled_df = pd.DataFrame(scaled_df, columns=numeric_df.columns, index=numeric_df.index)
scaled_df.plot.density(ax=axis[1])
axis[1].set_title('Scaled data')
axis[1].set_xlabel('Scaled value')
profile
profile.describe()
set(profile.gender.values.tolist())
profile[profile['age'] == 118]
Observation
It is obvious that 'age' value 118 is a placeholder for the missing value. The above analysis suggests that 2175 entries are missing for gender, age and income columns together
We should consider removing all rows if we consider any demographic data in order to find insights Currently, let's keep it as-is
profile['gender'].value_counts()
profile.isnull().sum()
profile.became_member_on
Observation
profile.became_member_on = pd.to_datetime(profile.became_member_on, format='%Y%m%d', errors='coerce')
profile.head()
profile.became_member_on.min()
profile['member_since_x_days'] = (
pd.to_datetime('today') - profile.became_member_on)
profile['member_since_x_days'] = profile['member_since_x_days'].dt.days
profile.head(2)
Let's create some visualisations to analyze the demographic data
age_df = pd.DataFrame(profile.age[~(profile.age == 118)])
description_pdstyle = \
pd.DataFrame(age_df.describe()).style \
.set_caption('Ages description') \
.set_table_attributes('style="display:inline;' \
'vertical-align:top"')
fig1 = Figure(figsize=(10,4))
axs = fig1.subplots(nrows=1, ncols=2)
axs = axs.flatten()
axs[0].set_title('Box Plot')
age_df.plot.box(ax=axs[0])
axs[1].violinplot(age_df.values)
axs[1].set_title('Violin Plot')
bins = 84
fig2, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
axs = axs.flatten()
age_df.hist(bins=bins, ax=axs[0], cumulative=True, density=True)
axs[0].set_title('Cumulative distribution')
axs[0].set_ylabel('Customers (%)')
axs[0].set_yticklabels((axs[0].get_yticks()*100).round(0))
axs[0].set_xscale('linear')
axs[0].set_xlabel('years')
axs[0].grid(False)
age_df.hist(bins=bins, ax=axs[1])
axs[1].set_title('Density')
axs[1].set_ylabel('Customers')
axs[1].set_xscale('linear')
axs[1].set_xlabel('years')
axs[1].grid(False)
age_df.plot.density(ax=axs[2])
axs[2].legend()
axs[2].set_title('Density plot')
axs[2].set_ylabel('density')
axs[2].grid(False)
age_gender_df = profile[~(profile.age == 118)].groupby('gender').age
age_gender_df.plot.density(ax=axs[3])
axs[3].set_title('Density by Gender')
axs[3].set_ylabel('density')
axs[3].legend()
axs[3].grid(False)
plt.show()
Observation
from bokeh.plotting import figure, output_file, show
def make_distribution_plot(title, hist, edges, x):
'''
This function
plots a density distribution of customer age
'''
p = figure(title=title, tools='', background_fill_color="#fafafa")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
fill_color="navy", line_color="white", alpha=0.5)
p.y_range.start = 0
p.y_range.end = 0.03
p.legend.location = "center_right"
p.legend.background_fill_color = "#fefefe"
p.xaxis.axis_label = 'x'
p.yaxis.axis_label = 'Pr(x)'
p.grid.grid_line_color="white"
return p
measured = np.asarray(age_df.age.tolist(), dtype=np.float32)
hist, edges = np.histogram(measured, density=True, bins=84)
x = measured
output_file('age_distribution.html')
p1 = make_distribution_plot("Age Distribution", hist, edges, x)
show(p1)
Let's create a histogram showing income range against number of customers
axs = profile.income.hist(bins = 20)
axs.grid(False)
axs.set_ylabel('# of customers')
axs.set_xlabel('Income in $')
Observation
Next, let's create a chart showing customer sign up year
axs = profile.became_member_on.groupby(profile.became_member_on.dt.year).hist(bins=30)
Observation
Let's draw some pair-wise plot of demographics data. We should drop the NA values
def draw_pair_plot(df):
sns.set(style= "ticks", color_codes=True)
normal_pair_plot = sns.pairplot(df[['age',
'income',
'gender',
'member_since_x_days']].dropna(),
hue='gender',
palette="husl",
markers=["o", "s", "D"])
normal_pair_plot
draw_pair_plot(profile)
Observation: It seems that there are not many female customers and the female group is relatively older and has higher income
Next, Let's visualise a pair plot using linear regression fit
#Fit linear regression models to the scatter plots:
sns.set(style= "ticks", color_codes=True)
reg_pair_plot = sns.pairplot(profile[['age',
'income',
'gender',
'member_since_x_days']].dropna(),
hue='gender',
palette="PuOr",
markers=["o", "s", "D"], kind= "reg")
reg_pair_plot
Observation
The above set of chart leads to some interesting observations
transcript
transcript.describe()
Observation
#Let's transform value column into separate columns
def transform_value_column(transaction):
'''transform/unpivot the value column and return new transcript dataframe'''
values = pd.DataFrame(transaction.value.tolist())
values.offer_id.update(values['offer id'])
values = values.drop('offer id', axis=1)
return transaction.join(values).drop('value', axis=1)
transcript = transform_value_column(transcript)
transcript.head(2)
transcript.event.value_counts()
Let's create a pie chart to visulise how many offer were completed and viewed relative to offers received
#https://docs.bokeh.org/en/latest/docs/gallery/pie_chart.html?highlight=pie%20chart (derived from here)
def draw_pie_chart(data_dict, output_file_name):
'''
This function
- draws a pie chart out of given data_dict
- output a html file with given 'output_file_name'
'''
from math import pi
import pandas as pd
from bokeh.io import output_file, show
from bokeh.palettes import Category20c
from bokeh.plotting import figure
from bokeh.transform import cumsum
output_file(output_file_name + ".html")
x = data_dict
data = pd.Series(x).reset_index(name='value').rename(columns={'index':'event'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Category20c[len(x)]
p = figure(plot_height=350, title="Pie Chart", toolbar_location=None,
tools="hover", tooltips="@event: @value", x_range=(-0.5, 1.0))
p.wedge(x=0, y=1, radius=0.4,
start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
line_color="white", fill_color='color', legend_field='event', source=data)
p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None
show(p)
data_dict_1 = {
'Transction': 138953,
'Offer received': 76277,
'Offer viewed': 57725,
'Offer completed': 33579
}
draw_pie_chart(data_dict_1, 'offer_distribution_1')
data_dict_2 = {
'Offer received': 76277,
'Offer viewed': 57725,
'Offer completed': 33579
}
draw_pie_chart(data_dict_2, 'offer_distribution_2')
Let's do a quick analysis to check if there exist extra customers not found in the profile data set
t_p_set =set(transcript['person'].values.tolist())
len(t_p_set)
p_set = set(profile['id'].values.tolist())
len(p_set)
print(t_p_set - p_set)
Observation
The above analysis suggests that all the persons in transaction is found in the profile
i.e. for each transaction, there is person with known demographics
transcript.isna().sum()
Observation
First of all, we should find out the total count of the completed offers after being viewed for each customer. The current data includes 'offers completed' by customers, knowingly or unknowingly. The customers who have unknowingly completed offers should not be sent out further offers. This target group represents profitable customers for the company.
In order to answer the above question,
Let's find out customers who have received the offers but have not viewed it. We combine our transcript dataset with the customer demographics (profiles) first and then identify those customers.
It could also be the case a single customer may have received multiple offers, but he may not view all of them but may have completed them unknowingly.
def join_data(transcript, profile, portfolio):
'''
This function
- joins a given dataframes in a single dataframe based on common columns
'''
df = transcript.merge(profile, left_on='person', right_on='id',
how='left').drop('id', axis=1)
df = df.merge(portfolio,
left_on='offer_id', right_on='id', how='left').drop('id', axis=1)
return df
merged_df = join_data(transcript, profile, portfolio)
merged_df.head(5)
#Let's look at the column names
merged_df.columns
#Let's rename rewards columns to signify that _x -> _t (transcript) and _x -> _p (portfoli)
merged_df = merged_df.rename(columns={'reward_x' : 'rewards_t' , 'reward_y' : 'rewards_p'})
Now we have joined dataset which we should explore further.
As first step, let's query a datset to find all the associated events for two particular customers and display it
display(pd.DataFrame().append([
merged_df.query('event=="offer received"').head(),
merged_df.query('event=="offer viewed"').head(),
merged_df.query('event=="transaction"').head(),
merged_df.query('event=="offer completed"').head(),
merged_df.query('person=="78afa995795e4d85b5d9ceeca43f5fef"').head(),
merged_df.query('person=="02c083884c7d45b39cc68e1314fec56c"').head()]))
Observation
As soon as an offer completes, an entry into 'rewards_t' equal to the 'rewards_p' happens.
Timestamp started at 0 when the first offer received event occurred.
When a customer makes a transaction which is higher than the offered reward within the offer duration (duration column, expressed in days), offer completes immediately (see records 47582, 47583), and reward gets credited to the customer.
The customer may receive two different offers at the same time and may complete two offers at the same time given a transaction exceeding the rewards is made.
Let's examine events separately
offer_received_events = pd.DataFrame().append([
merged_df[['person', 'event', 'time', 'offer_id', 'amount', 'duration','rewards_t', 'rewards_p']]
.query('event=="offer received"')])
offer_received_events.head(5)
offer_received_events.time.max()/24
offer_received_events.time.describe()
offer_received_events.time.value_counts().sort_values(ascending=True)
s_sent = offer_received_events.time.value_counts().sort_index(ascending=True)
#Let's find out on which days offers are sent by dividing timeline by 24.
offer_sent_time_in_hours = offer_received_events.time.value_counts().sort_index(ascending=True).index.tolist()
offer_sent_time_in_days = []
for x in offer_sent_time_in_hours:
if x != 0:
x = x/24
offer_sent_time_in_days.append(x)
offer_sent_time_in_days.append(0.0)
print(offer_sent_time_in_hours)
offer_sent_time_in_days.sort()
print(offer_sent_time_in_days)
offer_received_events.time.value_counts().sort_values(ascending=True).describe()
Observation
We can see that the timeline range is [0, 576] for the offer received event. Every offer gets received in 24 days period (or the data is sampled only for 24 days).
Ther are only 6 distinct time values present, and if we divide those values by 24, we get on which day the offer has been sent.
If we find statistics of those 6 days, we can see that on each of these days, around 12712 offers get received.
Assuming the offer gets received in equal proportions on each instance, the dataset is balanced with respect to the number of offers sent out at each time interval.
Let's explore offers viewed events
offer_viewed_events = pd.DataFrame().append([
merged_df[['person', 'event', 'time', 'offer_id', 'amount', 'duration','rewards_t', 'rewards_p']]
.query('event=="offer viewed"')])
offer_viewed_events.head(5)
offer_viewed_events.time.describe()
max_duration_in_days = offer_viewed_events.duration.value_counts().index.max()
max_duration_in_hours = max_duration_in_days * 24
max_allowed_view_duration = max_duration_in_hours + 576 # 576 is the last timestamp on which offers were sent out
max_allowed_view_duration
if offer_viewed_events.time.max() <= max_allowed_view_duration:
print('All viewed offers event has time stamp value less than or equal to max allowed duration to view offers')
else:
print('Time stamp for viewed offers event is not marked properly')
offer_viewed_events.time.value_counts()
The above series suggest that there are some higher number of offers being viewed at particular time. Let's plot a chart of number of offers viewed against a timeline
offer_viewed_time = offer_viewed_events.time.value_counts().sort_index()
offer_sent = offer_received_events.time.value_counts().sort_index()
#we will shrink the count of offers sent by 75% to make the chart looks nice.See legent for counting actual values
offer_sent_shrinked = offer_sent/4
p1 = None
output_file('offers_viewed_timeline.html')
p1 = figure(title="# of offer viewed over time (in hours)",
x_axis_label='timestamp in hours',
y_axis_label = '# of events',
plot_width=900, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
yticks = np.array([0, 50, 100, 200, 500, 1000, 2000, 3000, 3500])
p1.yaxis.ticker = yticks
p1.line(offer_viewed_time.index, offer_viewed_time, line_width =2)
p1.circle(offer_viewed_time.index, offer_viewed_time, fill_color = "white",color = "navy", size=8 , legend_label='offer_viewed')
p1.cross(offer_sent_shrinked.index, offer_sent_shrinked, size=12,
color="#E6550D", line_width=2, legend_label='* 4 = offers_sent')
show(p1)
Observation
The above chart displays the number of offers viewed against timestamp in hours. There are six peaks where the number of viewed offers increases dramatically.
I have embedded the number of offers received (correct received offer count is 4 times the marked one) in the chart. When new offers get received, the count of viewed offer increases and then slowly decreases over time.
Generally speaking, most offers are viewed within 24 hours of receipt.
Let's analyze offers completed events
offer_completed_events = pd.DataFrame().append([
merged_df[['person', 'event', 'time', 'offer_id', 'amount', 'duration','rewards_t', 'rewards_p']]
.query('event=="offer completed"')])
offer_completed_events.head(2)
offer_completed = offer_completed_events.time.value_counts().sort_index()
p2 = None
output_file("offers_completed_timeline.html")
p2 = figure(title="# of offer completed over time (in hours)",
x_axis_label='timestamp in hours',
y_axis_label = '# of event',
plot_width=900, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
yticks = np.array([0, 100, 200, 300, 400, 600, 800])
p2.yaxis.ticker = yticks
p2.line(offer_completed.index, offer_completed, line_width =2 , color= 'navy')
p2.square(offer_completed.index, offer_completed, color = "olive", size=8 , legend_label='offer_completed')
show(p2)
Observation
The above chart shows the number of completed offers against a timestamp in hours. It follows a similar pattern as the graph before because new offers are sent 6 times in 24 day period, and hence the chart has 6 peaks.
Between each peak, the completed offer chart decreases gradually, just like viewed offers chart. However, the decrease is not as smooth as for the viewed offer. There are some local picks during the decrease.
Let's analyse transaction event
transaction_events = merged_df.loc[merged_df['event'].isin(['transaction'])]
transaction_events.amount.max() #highest value single transaction recorded in dataset
transactions_count = transaction_events.time.value_counts()
transactions_count_index_sorted = transaction_events.time.value_counts().sort_index()
transactions_count.min() #minimum number of transaction at any given point of time in dataset
output_file("transaction_timeline.html")
p3 = figure(title="# of transacations over time (in hours)",
x_axis_label='timestamp in hours',
y_axis_label = '# of events',
plot_width=900, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
yticks = np.array([0, 50, 100, 200, 300, 400, 500, 600, 800, 900, 1100, 1300, 1700, 2200, 3000])
p3.yaxis.ticker = yticks
p3.line(transactions_count_index_sorted.index, transactions_count_index_sorted, line_width =2 , color= 'navy')
p3.diamond_cross(transactions_count_index_sorted.index, transactions_count_index_sorted, color = "red", size=8 , legend_label='Transactions')
show(p3)
Observation
The above chart shows that there are some initial transactions at timestamp 0.(635 transactions).
Now, some of it may have contributed to the completion of offers. For this, we should embed the completed offers chart to get some visual pictures.
The above chart seems to follow the picks as well, but it is not as smooth as other charts like offers viewed and offers completed.
p4 = None
output_file('offer_completed_tnx.html')
p4 = figure(title="# of transacations over time (in hours)",
x_axis_label='timestamp in hours',
y_axis_label = '# of events',
plot_width=950, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
yticks = np.array([0, 50, 100, 200, 300, 400, 500, 600, 800, 900, 1100, 1300, 1700, 2200, 3000])
p4.yaxis.ticker = yticks
p4.line(transactions_count_index_sorted.index, transactions_count_index_sorted, line_width =2 , color= 'navy')
p4.diamond_cross(transactions_count_index_sorted.index, transactions_count_index_sorted, color = "red", size=8 , legend_label='Transactions')
#p3.line(s.index, s, line_width =2)
#p3.circle(s.index, s, fill_color = "white",color = "navy", size=8 , legend_label='offer_viewed')
p4.cross(offer_sent_shrinked.index, offer_sent_shrinked/2, size=12,
color="#E6550D", line_width=2, legend_label='* 8 = offers_sent')
p4.line(offer_completed.index, offer_completed, line_width =2 , color= 'navy')
p4.square(offer_completed.index, offer_completed, color = "olive", size=8 , legend_label='offer_completed')
show(p4)
Observation
There is a local peak between each offer_sent our interval for transactions. This implies that when offers are received, customers quickly do some transactions resulting in offer completion.
At time stamp 0, there exist around 600 transactions. Around 200 of them lead to offer completion (number of offers completed at timestamp 0), which is roughly 25% conversion rate. This figure is roughly in agreement with the ratio of the total number of completed offers to the total number of transactions => 138953/33579)*100 = 24.16%
transactions_count_index_sorted.sum()
offer_completed.sum()
conversion_rate = offer_completed.sum()/transactions_count_index_sorted.sum()
conversion_rate*100
Next Steps
There exist some high-value transactions (above 100 USD). It could be some large orders from individuals who are organizing events or from corporate customers. Such a high-value transaction leads to all of the current offers associated with those customers to completion. But in reality, these transactions were not motivated to complete the offers, and therefore it should be considered as side effects. Sending out offers to such high paying customers would not lead to an increase or decrease in their purchasing patterns. Therefore we should remove such high-value transactions from the dataset.
Next, we should also find out and remove non-responsive customers. I define non-responsive customer = no received offered viewed + not a single transaction made
merged_df_by_person = merged_df.groupby('person')
customers_no_transaction =merged_df_by_person.count().query('amount == 0').index
A = set(customers_no_transaction)
Future Task: Get profile of above 422 customers and anaylze their demographics if those are interested
offers_viewed = merged_df.query \
('person in @customers_no_transaction and event == "offer viewed"').groupby('person').count()
B = set(offers_viewed.index)
C = A-B
C
print (C)
print(str(len(C)) + ' truly non responsive customers found')
Observation
Set C contains all the customer who have not viewed any offers and has not made any transaction.
Therfore Set C represents a truly non-responsive customers (should not be considered for further offers)
offers_received = merged_df.query \
('person in @customers_no_transaction and event == "offer received"').groupby('person').count()
D = set(offers_received.index)
len(D)
E = A-D
len(E)
Observation
Let's examine a transaction amounts
transactions_amount = transaction_events.amount.value_counts().sort_index()
p5 = None
output_file('transaction_amount.html')
p5 = figure(title="# of transacations against spent amount)",
x_axis_label='Amount($)',
y_axis_label = '# of transactions',
plot_width=900, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
yticks = np.array([0, 50, 100, 200, 300, 400, 500])
p5.yaxis.ticker = yticks
p5.scatter(transactions_amount.index, transactions_amount, marker= "circle", color= 'orange', radius = 4)
show(p5)
Observation
There are some transactions > 1000 USD.
There are many more transactions in the range [0.05,50] than the rest of the range
Any transaction greater than 50 USD can be treated as a high-value transaction and not necessarily motivated to complete the offers.
We should remove any transaction > 50 in our data modeling
Let's print some interesting information regarding transaction amounts
print('Number of repeated customers with more than 1 transaction with a value >= $50,00:',
transaction_events[transaction_events.amount >= 50].person.duplicated().sum())
print('Number of customers with transactions with a value >= $50,00:',
transaction_events[transaction_events.amount >= 50].person.nunique())
print('Number of customers with transactions with a value <= $50,00:',
transaction_events[transaction_events.amount <= 50].person.nunique())
print('Number of customers with transactions with a value <= $10,00:',
transaction_events[transaction_events.amount <= 10].person.nunique())
print('Number of customers with transactions with a value <= $5 :',
transaction_events[transaction_events.amount <= 5].person.nunique())
print('Number of customers with transactions with a value <= $1 :',
transaction_events[transaction_events.amount <= 1].person.nunique())
print('Number of customers with transactions with a value <= $0.25 :',
transaction_events[transaction_events.amount <= 0.25].person.nunique())
len(A)
#list(A)
#Let's remove non-responsive customers
final_df = merged_df[~merged_df.person.isin(A)]
final_df.head(2)
#Let's remove transcation more than 50 USD. We need to fill NA values before we can filter those transactions
final_df = final_df.fillna(0)
final_df = final_df[final_df.amount <50]
final_df = final_df.reset_index(drop= True)
# Adding cumulative amount spent
final_df['cum_amount'] = final_df.groupby('person').amount.cumsum()
Now, final_df is ready for feature engineering purpose!
In this section, we will first understand the customer interaction with offers through some advanced visulisations.
Based on that, we will define some custom calculated features that could be useful for our classification problem.
Furthermore, the analysis should also help us in defining our label or precisely as part of Label Engineering
def person_data(df, person):
'''
Displays unique customer's event history
Parameters
-----------
person: if int then customer index as per the order in which
customer appears in transcript data, if string then person
referenced by their unique 'person' id
'''
if type(person) == str:
return df[df.person == person]
else:
return df[df.person == df.person.unique()[person]]
person_data(final_df, 89)
p11 = None
p11 = figure(title="Cumulative spend over time (in days)",
x_axis_label='timestamp in days',
y_axis_label = 'Cumulative spend in $',
plot_width=950, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
yticks = np.array([0, 25, 50, 75, 100, 200, 300])
p11.yaxis.ticker = yticks
def plot_customer_journey(df, person, canvas, tnx_color='black', line_color='pink', is_show = True):
'''
This function
- plots a customer interaction with offers via chart showing transacation value against time stamp
Input:
- person : person id as string or index of person in df as integer
- canvas : bokeh canvas object to draw journey
- tnx_color : color of transaction events
- line_color : color of step lines
- is_show : Boolean flag indicating if the plot/journey should be shown or canvas should be returned
Return:
- canvas object containing customer jounery. This canvas can be used to combing another customer
journey by calling this function on that customer with same canvas.
'''
x = []
y = []
lt =['transaction', 'offer received', 'offer viewed','offer completed']
markers = ['circle', 'inverted_triangle', 'triangle', 'x']
colors = [tnx_color, 'chocolate', 'darkcyan', 'salmon']
p10 = canvas
p10.legend.location = 'bottom_right'
if type(person) == str:
person_label = person[0:3]
else:
person_label = str(person)
for i, event in enumerate(lt):
if event == 'transaction':
x.append(person_data(df, person).time/24)
y.append(person_data(df,person).cum_amount)
p10.step(x[i], y[i],line_width =0.8 , color= line_color , legend_label = 'spend amount: ' + person_label )
p10.scatter(x[i], y[i], marker= markers[i], color = colors[i], legend_label = event +': ' + person_label , size=5)
else:
try:
x.append(person_data(df, person)[person_data(df, person).event == event].time/24)
y.append(person_data(df, person)[person_data(df, person).event == event].cum_amount)
p10.scatter(x[i], y[i], marker= markers[i], color = colors[i], legend_label = event, size=10)
except:
pass
if event == 'offer received':
received = person_data(df, person)[person_data(df, person).event=='offer received']\
[['time', 'difficulty', 'cum_amount', 'duration']]\
.reset_index()
for i in received.index:
x_diff = [received.iloc[i].time/24,
received.iloc[i].time/24 + received.iloc[i].duration]
y_diff = [received.iloc[i].cum_amount,
received.iloc[i].cum_amount + received.iloc[i].difficulty]
p10.line(x_diff, y_diff, color='navy',line_width =0.5, legend_label = 'offer duration')
if is_show == False:
return p10
else:
show(p10)
p11 = plot_customer_journey(final_df, 57, p11, 'olivedrab', 'pink', False)
output_file("dual_customers_journey.html")
plot_customer_journey(final_df, 52, p11, 'darkkhaki', 'black')
Observation
Based on above analysis We will derive some useful custom features
def percent_of_offers_completed(person_data_df):
'''
This function
- calculate the percentage of offers completed by a particular customer
Input:
- person_data_df is dataframe containing offer and transaction information for a single person
Output:
- percentage count value as floting point number
'''
percent_count = 0
sent_offer_count = 0
completed_offer_count = 0
for x in person_data_df.event:
if x == 'offer received':
sent_offer_count += 1
if x == 'offer completed':
completed_offer_count += 1
if sent_offer_count == 0:
percent_count = 0
else:
percent_count = (completed_offer_count/sent_offer_count) * 100
return percent_count
def absolute_offers_completed(person_data_df):
'''
This function
- calculates the absolute number of offers completed by a particular customer without considering if the
offer is viewed before completion or not
Input:
- person_data_df is dataframe containing offer and transaction information for a single person
Output:
- offer completion count value as integer
'''
completed_offer_count = 0
for x in person_data_df.event:
if x == 'offer completed':
completed_offer_count += 1
return completed_offer_count
def percent_of_offers_viewed(person_data_df):
'''
This function
- calculate the percentage of offers viewed by a particular customer
Input:
- person_data_df is dataframe containing offer and transaction information for a single person
Output:
- percentage count value as floting point number
'''
percent_count = 0
sent_offer_count = 0
viewed_offer_count = 0
for x in person_data_df.event:
if x == 'offer received':
sent_offer_count += 1
if x == 'offer viewed':
viewed_offer_count += 1
if sent_offer_count == 0:
percent_count = 0
else:
percent_count = (viewed_offer_count/sent_offer_count) * 100
return percent_count
def absolute_offers_viewed(person_data_df):
'''
This function
- calculate the number of offers viewed by a particular customer
Input:
- person_data_df is dataframe containing offer and transaction information for a single person
Output:
- offer viewed count value as integer
'''
viewed_offer_count = 0
for x in person_data_df.event:
if x == 'offer viewed':
viewed_offer_count += 1
return viewed_offer_count
#https://stackoverflow.com/questions/72899/how-do-i-sort-a-list-of-dictionaries-by-a-value-of-the-dictionary
def sort_key_func(item):
""" helper function used to sort list of dicts
:param item: dict
:return: sorted list of tuples (k, v)
"""
pairs = []
for k, v in item.items():
pairs.append(v)
return sorted(pairs)
def find_most_responsive_customer(df, limit):
'''
This function should provide a list of dictionaries specifying customer id
and number of completed offers (weighted by number of offers sent) for that customer
in a given dataframe sorted by values of completed offer
'''
person_list = df.person.unique().tolist()
if limit != None:
if type(limit) == int:
person_list = person_list[:limit]
percent_offers_completed = []
abs_offers_completed = []
percent_offers_viewed = []
abs_offers_viewed = []
for person in person_list:
element_p_o_c = {}
element_a_o_c = {}
element_p_o_v = {}
element_a_o_v = {}
person_data_df = person_data(df, person)
value_p_o_c = percent_of_offers_completed(person_data_df)
value_a_o_c = absolute_offers_completed(person_data_df)
value_p_o_v = percent_of_offers_viewed(person_data_df)
value_a_o_v = absolute_offers_viewed(person_data_df)
if type(person) == str:
element_p_o_c[person] = value_p_o_c
percent_offers_completed.append(element_p_o_c)
element_a_o_c[person] = value_a_o_c
abs_offers_completed.append(element_a_o_c)
element_p_o_v[person] = value_p_o_v
percent_offers_viewed.append(element_p_o_v)
element_a_o_v[person] = value_a_o_v
abs_offers_viewed.append(element_a_o_v)
else:
continue
percent_offers_completed =sorted(percent_offers_completed, key=sort_key_func, reverse=True)
abs_offers_completed = sorted(abs_offers_completed, key=sort_key_func, reverse=True)
percent_offers_viewed = sorted(percent_offers_viewed, key=sort_key_func, reverse=True)
abs_offers_viewed = sorted(abs_offers_viewed, key=sort_key_func, reverse=True)
return percent_offers_completed, abs_offers_completed, percent_offers_viewed, abs_offers_viewed
#Let's call the above function to create list of custom features related to offers
percent_offers_completed, abs_offers_completed, percent_offers_viewed, abs_offers_viewed = find_most_responsive_customer(final_df, None)
percent_offers_completed[:5]
abs_offers_completed[:5]
percent_offers_viewed[:5]
abs_offers_viewed[:5]
Observation
The above analysis suggests that maximum 6 offers were sent out to at least one customer in 30 days period.
unique_persons = final_df.person.unique().tolist()
# there exist a 16578 unique customers in the dataset
len(unique_persons)
p7 = None
output_file('single_customer_journey.html')
p7 = figure(title="Cumulative spend over time (in days)",
x_axis_label='timestamp in days',
y_axis_label = 'Cumulative spend in $',
plot_width=950, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
plot_customer_journey(final_df, '6e014185620b49bd98749f728747572f', p7)
Observation
Here in the above chart, we have a customer who has completed all the offers sent out to him/her. It usually takes more than one transaction to complete an offer if those transactions are low value.
Sometimes, the customer views the offer instantly while sometimes, he/she views it at a later time.
Out of 4 offers completed, only 1 (last completed) is not viewed by the customer.
p12 = None
output_file("dual_high_value_customers_journey.html")
p12 = figure(title= " Cumulative spend over time (in days) ",
x_axis_label='timestamp in days',
y_axis_label = 'Cumulative spend in $',
plot_width=950, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
#yticks = np.array([0, 25, 50, 75, 100, 200, 300])
#p12.yaxis.ticker = yticks
p12 = plot_customer_journey(final_df, '9fa9ae8f57894cc9a3b8a9bbe0fc1b2f', p12, 'olivedrab', 'pink', False)
plot_customer_journey(final_df, 'fe8264108d5b4f198453bbb1fa7ca6c9', p12, 'indigo', 'skyblue')
Observation
Here we have two high paying customers side by side who has completed all the received offers.
Interestingly, for customer '9fa' after completing the first offer, he/she has made 5 transactions amounting in total around 75 USD without any pending offer to complete. It demonstrates that those transactions were not motivated to complete the offer and show customers spending tendencies in the absence of any offers.
def find_total_spend_amount(df, limit = None):
'''
This function
- calculates a total amount spent by each customer during the time frame of 30 days
Input:
df: input dataframe containing all the customers and correponding offer and transcation information
limit: integer value to limit the operation for only limited number of customer
It will slice the dataframe first (selecting limited rows) and then performs calculation
Output:
Sorted array based on amount value where each element is a key-value pair i.e.{person_id, spend_amount}
'''
person_list = df.person.unique().tolist()
if limit != None:
if type(limit) == int:
person_list = person_list[:limit]
amounts = []
for person in person_list:
element = {}
person_data_df = person_data(df, person)
max_amount = person_data_df.cum_amount.max()
if type(person) == str:
element[person] = max_amount
amounts.append(element)
amounts = sorted(amounts, key=sort_key_func, reverse=True)
return amounts
amounts = find_total_spend_amount(final_df, None)
len(amounts)
def create_dataframe_from_list(input_list, name):
'''
This function
- creates a dataframe out of a list having dictionary elements. Key of each element is person_id and
value is corresponding target value (i.e. total spend amount, percent of offers completed)
'''
df1 = pd.DataFrame.from_dict(input_list, orient='columns').astype(float).sort_index()
df1 = df1.T
df1[name] = df1.sum(axis=1)
df1 = df1[name]
df2 = pd.DataFrame([df1])
df = df2.T
df['person'] = df.index
df = df.reset_index(drop=True)
return df
df = create_dataframe_from_list(amounts, 'spend')
amounts_df = df
amounts_df.head(2)
percent_offers_viewed_df = create_dataframe_from_list(percent_offers_viewed, 'percent_offers_viewed')
percent_offers_viewed_df.head(2)
abs_offers_viewed_df = create_dataframe_from_list(abs_offers_viewed, 'abs_offers_viewed')
abs_offers_viewed_df.head(2)
percent_offers_completed_df = create_dataframe_from_list(percent_offers_completed, 'percent_offers_completed')
percent_offers_completed_df.head(2)
abs_offers_completed_df = create_dataframe_from_list(abs_offers_completed, 'abs_offers_completed')
abs_offers_completed_df.head(2)
final_df.columns
def save_file(df, name, path='../data/processed', latest=True):
'''
Helper function saves dataFrame to .joblib format
default path '../../data/interim'
Parameters
-----------
df: given dataFrame
name: filename as string value
'''
joblib.dump(df, path + '/' + name, compress=True)
if latest:
joblib.dump(df, path + '/' + 'latest.joblib', compress=True)
print('saved as {}'.format(path + '/' + name))
def create_target_df(df):
'''
This function
- merges customer features into given df
- perfom some clean up of redudant columns
- only select offers received event in the final dataframe
- saves a dataframe as joblib file which can be used for model training later
'''
df = df.loc[df['event'].isin(['offer received'])]
df = df.merge(abs_offers_completed_df, left_on='person', right_on='person',
how='left')
df = df.merge(amounts_df, left_on='person', right_on='person',
how='left')
df = df.merge(percent_offers_viewed_df, left_on='person', right_on='person',
how='left')
df = df.merge(percent_offers_completed_df, left_on='person', right_on='person',
how='left')
df = df.merge(abs_offers_viewed_df, left_on='person', right_on='person',
how='left')
df= df.drop(['amount','rewards_t','is_email'], axis=1)
df =df.drop(['became_member_on'], axis=1)
df.drop(['event'], axis=1, inplace=True)
save_file(df, name= str(pd.to_datetime('today'))+'_target_df.joblib')
return df
df = create_target_df(final_df)
df
new_df = joblib.load('../data/processed/latest.joblib')
new_df
Our problem falls into the category of multi-class classification. The multi-class classification problem can be summarised as below:
Given
a dataset with instances 𝑥𝑖 together with 𝑁 classes where every instance 𝑥𝑖 belongs precisely to one class 𝑦𝑖
is a problem targeted for a multiclass classifier.
After the training and testing, we have a table with the correct class 𝑦𝑖 and the predicted class 𝑎𝑖 for every instance 𝑥𝑖 in the test set. So for every instance, we have either a match (𝑦𝑖=𝑎𝑖) or a miss (𝑦𝑖≠𝑎𝑖).
Assuming we have balanced class distribution in our training set, evaluation using a confusion matrix together with the average accuracy score should be sufficient. However, F1-score can also be used for the evaluation of the multi-class problem.
Since the cost of misclassification is not high in our case (sending an offer to the non-responsive customer doesn't cost the company extra money), F1-score is not necessary.
In this project, I prefer to use the confusion matrix and the average score as our evaluation measures.
Confusion Matrix:
A confusion matrix shows the combination of the actual and predicted classes. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. It is a good measure of whether models can account for the overlap in class properties and understand which classes are most easily confused.
Accuracy:
Percentage of total items classified correctly- (TP+TN)/(N+P)
For unbalanced class distribution, I have provided weights to each class label, and CatBoost automatically handles it.
Evaluation Strategy:
1. Initial Model Evaluation with fixed values of hyperparameters
I evaluate CatBoost Classifier with following fixed hyperparameters on all classification problems (class_2, class_3, class_4, class_5)
eval_metric = 'Accuracy'
and the rest of the parameter values as default provided by CatBoost.
2. Model Evaluation with finding best values of hyperparameters using GridSearch
In this round of experiments, I wrote a custom GridSearch function, which finds the best values for each given hyperparameter ranges and returns the model hyperparameters with the best average accuracy on training data.
For CatBoost Multiclassifier, there is numerous hyperparameter to tune. An extensive list can be found here: https://catboost.ai/docs/concepts/parameter-tuning.html
Here I selected only the following hyperparameters and specified the recommended range(found via some research on CatBoost website and Kaggle)for each of these parameters and find the model with the best score.
- iterations = [1000,3000] no. of boosting iterations
- loss_function = ['Logloss','MultiClass','MultiClassOneVsAll']
- depth = [4,6,8] maximum depth of the tress
- early_stopping_rounds = [10, 20, 50] parameter for fit() - stop the training if one metric of a validation data does not improve in last early_stopping_rounds rounds. </br>
def f_2(row):
if row['percent_offers_completed'] >=50 :
val = 0
else:
val = 1
return val
def f_3(row):
if row['percent_offers_completed'] >=66:
val = 0
elif row['percent_offers_completed'] <33:
val = 1
else:
val = 2
return val
def f_4(row):
if row['percent_offers_completed'] >=75 :
val = 0
elif row['percent_offers_completed'] <25:
val = 1
elif row['percent_offers_completed'] >=25 and row['percent_offers_completed'] < 50:
val = 2
else:
val = 3
return val
def f_5(row):
if row['percent_offers_completed'] >=80 :
val = 0
elif row['percent_offers_completed'] <20:
val = 1
elif row['percent_offers_completed'] >=20 and row['percent_offers_completed'] < 40:
val = 2
elif row['percent_offers_completed'] >=40 and row['percent_offers_completed'] < 60:
val = 3
else:
val = 4
return val
def load_dataset(name=None, default_path='../data/processed/', tag='latest'):
'''
This function
- loads a joblib file into dataframe from default_path location
- It loads lastest created file from the location unless name is specified
'''
if name == None:
df = joblib.load(default_path + 'latest.joblib')
else:
df = joblib.load(default_path + name)
return df
def create_class_label(df, n_class):
'''
This function
- create a label column in given dataframe based on value of n_class
- i.e. if n_class = 2, it create two labels based on following conditions
if percent_offers_completed >= 50 then label = 1
if percent_offers_completed < 50 then label = 2
- return a dataframe
'''
f_list = [f_2, f_3, f_4, f_5]
df['label'] = df.apply(f_list[n_class-2], axis=1)
return df
def feature_clean_up(df):
'''
This function
- removes redudant features
- removes features from which labels were derived
'''
df = df.drop('percent_offers_completed', axis=1)
df = df.drop('abs_offers_completed', axis=1)
df = df.drop('percent_offers_viewed', axis=1)
df = df.drop('abs_offers_viewed', axis=1)
return df
def train_test_base_model(n_class, custom_categories = False, categories_list = None, verbose=True, labels = None):
'''
This function
- loads the prepared dataset into dataframe
- creates a class labels as per 'n_class' value [max capped at 5]
- removes redudant features
- save a processed dataframe as joblib file
- create test and train split
- create catboost pool
- configures catboost model with default parameters for multiclass classifiers
- configures default catergorical attrributes unless customer catergories are provided
- train/fit a model with fixed number of iterations and fixed values of hyperparameters
- returns learning rate, # of iterations, accuracy score, f1-score, and n_class values in a dictionary element list
'''
df = load_dataset()
if n_class > 5:
n_class = 5
if n_class < 2:
n_class = 2
df = create_class_label(df, n_class)
df = feature_clean_up(df)
save_file(df, 'class_' + str(n_class) + '.joblib', path='../data/processed', latest=False)
y = df.label
X = df.drop('label', axis=1)
if custom_categories != True:
categorical_features_indices = np.where(X.dtypes != np.float)[0]
else:
# Assigns columns index location of categorical features
if categories_list != None:
categorical_features_indices = [X.columns.get_loc(i) for i in categories_list if i in X.columns]
categorical_features_indices = np.where(X.dtypes != np.float)[0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=True,
random_state=21)
# Assigns weights to labels since these are unbalanced
weights = [df.label.value_counts().sum() / df.label.value_counts()[i] for i in
set(labels.values())]
train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features_indices)
test_pool = Pool(data=X_test, label=y_test, cat_features=categorical_features_indices)
#reset model object every time this function is called
model = None
model = CatBoostClassifier(
iterations= 2000,
loss_function='MultiClass',
early_stopping_rounds=50,
cat_features=categorical_features_indices,
class_weights= weights,
eval_metric = 'Accuracy',
use_best_model=True,
logging_level='Silent')
model = model.fit(train_pool,
eval_set=test_pool,
plot=True);
preds_class = model.predict(X_test)
if verbose:
display(F'Learning Rate set to: {model.get_all_params()["learning_rate"]}')
display(F'Accuracy Score: {accuracy_score(y_test, preds_class)}')
display(F'Weights: {weights}')
display(F"F1-Score: {f1_score(y_test, preds_class,average='weighted')}")
matrix = confusion_matrix(y_test, preds_class)
print('-'*50)
print('Confusion Matrix: ')
print(matrix)
print('-'*50)
'''
print('Starting the trainig with 3-Fold validation using Catboost CV:')
print('-'*50)
cv_params = model.get_params()
cv_data = cv(train_pool,
cv_params,
iterations = 500,
shuffle = False,
early_stopping_rounds=50,
plot=True)
print(cv_data)
'''
e_learning_rate = {}
e_learning_rate['learning_rate'] = model.get_all_params()["learning_rate"]
e_acc_score = {}
e_acc_score['accuracy_score'] = accuracy_score(y_test, preds_class)
e_f1_score = {}
e_f1_score['f1_score'] = f1_score(y_test, preds_class,average='weighted')
e_n_iteration = {}
e_n_iteration['n_iterations'] = 2000
e_n_class = {}
e_n_class['n_class'] = n_class
r_list = []
r_list.append(e_learning_rate)
r_list.append(e_acc_score)
r_list.append(e_f1_score)
r_list.append(e_n_iteration)
r_list.append(e_n_class)
return r_list
- We will perform standard test-train split using sklearns 'train_test_split' function
- test_size is set to 0.4 (40%) and shuffling is set to True
- For reproducibilty, random_state is selcted as '21'
encoding = {
'responsive': 0,
'unresponsive':1
}
results_2_class = train_test_base_model(n_class=2, labels=encoding)
print( results_2_class)
learning_acc_2 = 0.95157
encoding = {'responsive': 2,
'very_responsive': 0,
'unresponsive':1
}
results_3_class = train_test_base_model(n_class=3, labels=encoding)
print(results_3_class)
learning_acc_3 = 0.88500
encoding = {'responsive': 3,
'very_responsive': 0,
'moderately_responsive': 2,
'unresponsive':1
}
results_4_class = train_test_base_model(n_class=4, labels=encoding)
learning_acc_4 = 0.84480
encoding = {'responsive': 4,
'very_responsive': 0,
'moderately_responsive': 3,
'very_moderately_responsive':2,
'unresponsive':1
}
results_5_class = train_test_base_model(n_class=5, labels=encoding)
learning_acc_5 = 0.83440
#results_5_class
#results_4_class
#results_3_class
#results_2_class
base_train_accuracy = [learning_acc_2,learning_acc_3,learning_acc_4,learning_acc_5]
base_test_accuracy = [results_2_class[1]['accuracy_score'],results_3_class[1]['accuracy_score'],results_4_class[1]['accuracy_score'],results_5_class[1]['accuracy_score']]
p_base = None
output_file('base_evaluation.html')
p_base = figure(title="Base Model evaluation with 2000 iterations & fixed hyperparameters",
x_axis_label='# number of target classes',
y_axis_label = 'Average accuracy',
plot_width=500, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
x = [2,3,4,5]
xticks = np.array([2,3,4,5])
p_base.xaxis.ticker = xticks
p_base.circle(x, base_test_accuracy, color= 'orange', size = 6,legend_label = 'Test accuracy')
p_base.square(x, base_train_accuracy, color= 'navy', size = 6,legend_label = 'Trainning accuracy')
p_base.legend.location = 'bottom_left'
show(p_base)
Observation As shown in the abover chart, training & test accuracy decreases as the number of labels increase.
RANDOM_STATE = 0
def get_xy(n_class, labels, custom_categories=False):
'''
This function
- loads the prepared dataset into dataframe
- creates a class labels as per 'n_class' value [max capped at 5]
- removes redudant features
- save a processed dataframe as joblib file
- finds a categorical feature indices from dataset
- finds weights for each label
- Returns X, y and weights
'''
df = load_dataset()
if n_class > 5:
n_class = 5
if n_class < 2:
n_class = 2
df = create_class_label(df, n_class)
df = feature_clean_up(df)
save_file(df, 'class_' + str(n_class) + '.joblib', path='../data/processed', latest=False)
y = df.label
X = df.drop('label', axis=1)
# Assigns weights to labels since these are unbalanced
weights = [df.label.value_counts().sum() / df.label.value_counts()[i] for i in
set(labels.values())]
if custom_categories != True:
cat_features = np.where(X.dtypes != np.float)[0]
else:
# Assigns columns index location of categorical features
if categories_list != None:
cat_features = [X.columns.get_loc(i) for i in categories_list if i in X.columns]
cat_features = np.where(X.dtypes != np.float)[0]
return X, y, cat_features, weights
def cross_val(X, y, param, cat_features, weights, n_splits=3):
'''
This function
- performs a cross validation on given X, y using given set of parameters and finds the model with
accuracy
Input:
X -> Training dataset
y -> Test dataset
param -> set of parameter ot use
weights -> class weights to use
n_split -> number of splits for K-Fold validation (i.e. value of K)
Return:
average accuracy = sum of accuracy / number of split
Note: code adapted from https://www.kaggle.com/miklgr500/catboost-with-gridsearch-cv
'''
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state = RANDOM_STATE)
acc = []
predict = None
for tr_ind, val_ind in skf.split(X, y):
X_train = X.iloc[tr_ind]
y_train = y.iloc[tr_ind]
X_valid = X.iloc[val_ind]
y_valid = y.iloc[val_ind]
clf = CatBoostClassifier(iterations=param['iterations'],
loss_function = param['loss_function'],
depth=param['depth'],
early_stopping_rounds = param['early_stopping_rounds'],
eval_metric = 'Accuracy',
class_weights= weights,
use_best_model=True,
logging_level='Silent'
)
clf.fit(X_train,
y_train,
cat_features=cat_features,
eval_set=(X_valid, y_valid)
)
y_pred = clf.predict(X_valid)
accuracy = accuracy_score(y_valid, y_pred)
acc.append(accuracy)
return sum(acc)/n_splits
def catboost_GridSearchCV(X, y, params, cat_features, weights, n_splits=5):
'''
This function
- do iterative search for best parameter by calling cross_val function
- It identifies best parameters by tracking the accuracy
- It returns a list of best found hypterparameters
'''
ps = {'acc':0,
'param': []
}
predict=None
for prms in tqdm(list(ParameterGrid(params)), ascii=True, desc='Params Tuning:'):
acc = cross_val(X, y, prms, cat_features, weights, n_splits=5)
if acc>ps['acc']:
ps['acc'] = acc
ps['param'] = prms
print('Acc: '+str(ps['acc']))
print('Params: '+str(ps['param']))
return ps['param']
def train_test_tuned_model(n_class, labels = None, verbose= True):
'''
This function
- Users get_xy function to prepare initial dataset and categorical features
- Splits a dataset into train and test set
- Calls catboost_GridSearchCV function to receive best parameters
- Fits a model using best parameters
- Test a model and returns a list of performance matric values
'''
X, y, cat_features, weights = get_xy(n_class, labels)
#initial train and test split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=True,
random_state=21)
# remove LogLoss for n_class >2 from loss_function list below
# as LogLoss only works for binary classification
params = {'depth':[4, 6, 8],
'loss_function': ['MultiClass', 'MultiClassOneVsAll', 'Logloss'],
'iterations' : [1000, 3000],
'early_stopping_rounds' : [10, 20, 50]
}
param = catboost_GridSearchCV(X_train, y_train, params, cat_features, weights)
clf = CatBoostClassifier(iterations= param['iterations'],
loss_function = param['loss_function'],
depth=param['depth'],
early_stopping_rounds = param['early_stopping_rounds'],
eval_metric = 'Accuracy',
class_weights= weights,
use_best_model=True
)
X_train, X_valid, y_train, y_valid = train_test_split(X_train,
y_train,
shuffle=True,
random_state=RANDOM_STATE,
train_size=0.9,
stratify=y_train
)
print('Best Parameters found:')
print('-'*20)
print(param)
clf.fit(X_train,
y_train,
cat_features=cat_features,
logging_level='Silent',
eval_set=(X_valid, y_valid),
plot=True
)
preds_class = clf.predict(X_test)
if verbose:
display(F'Learning Rate set to: {clf.get_all_params()["learning_rate"]}')
display(F'Accuracy Score: {accuracy_score(y_test, preds_class)}')
display(F'Weights: {weights}')
display(F"F1-Score: {f1_score(y_test, preds_class,average='weighted')}")
matrix = confusion_matrix(y_test, preds_class)
print('-'*50)
print('Confusion Matrix: ')
print(matrix)
print('-'*50)
e_learning_rate = {}
e_learning_rate['learning_rate'] = clf.get_all_params()["learning_rate"]
e_acc_score = {}
e_acc_score['accuracy_score'] = accuracy_score(y_test, preds_class)
e_f1_score = {}
e_f1_score['f1_score'] = f1_score(y_test, preds_class,average='weighted')
e_n_iteration = {}
e_n_iteration['n_iterations'] = param['iterations']
e_n_class = {}
e_n_class['n_class'] = n_class
r_list = []
r_list.append(e_learning_rate)
r_list.append(e_acc_score)
r_list.append(e_f1_score)
r_list.append(e_n_iteration)
r_list.append(e_n_class)
#sub = pd.DataFrame({'Actual':y_test['label'],'Predicted':np.array(preds_class).astype(int)})
#sub.to_csv('../reports/results/' + 'class_'+ str(n_class) + '_optimised_results'+'.csv',index=False)
#print('saved results to ' + '../reports/results/' + 'class_'+ str(n_class) + '_optimised_results'+'.csv')
return r_list
encoding = {
'responsive': 0,
'unresponsive':1
}
results_class_1 = train_test_tuned_model(n_class=2, labels=encoding)
best_params_class_2 = {'depth': 4, 'early_stopping_rounds': 50, 'iterations': 3000, 'loss_function': 'MultiClass'}
results_class_2 = results_class_1
test_accuracy_class_2 = 0.94000971
encoding = {'responsive': 2,
'very_responsive': 0,
'unresponsive':1
}
results_class_3 = train_test_tuned_model(n_class=3, labels=encoding)
results_class_3
test_accuracy_class_3 = 0.87919
best_params_class_3 = {'depth': 6, 'early_stopping_rounds': 50, 'iterations': 3000, 'loss_function': 'MultiClass'}
encoding = {'responsive': 3,
'very_responsive': 0,
'moderately_responsive': 2,
'unresponsive':1
}
results_class_4 = train_test_tuned_model(n_class=4, labels=encoding)
results_class_4
test_accuracy_class_4 = 0.83747
best_params_class_4 = {'depth': 8,
'early_stopping_rounds': 50,
'iterations': 1000,
'loss_function': 'MultiClassOneVsAll'}
encoding = {'responsive': 4,
'very_responsive': 0,
'moderately_responsive': 3,
'very_moderately_responsive':2,
'unresponsive':1
}
results_class_5 = train_test_tuned_model(n_class=5, labels=encoding)
results_class_5
best_params_class_5 = {'depth': 8, 'early_stopping_rounds': 50, 'iterations': 1000, 'loss_function': 'MultiClass'}
test_accuracy_class_5 = 0.808035
#results_class_2
#results_class_3
#results_class_4
#results_class_5
tuned_test_accuracy = [test_accuracy_class_2,test_accuracy_class_3,test_accuracy_class_4,test_accuracy_class_5]
tuned_train_accuracy = [results_class_2[1]['accuracy_score'],results_class_3[1]['accuracy_score'],results_class_4[1]['accuracy_score'],results_class_5[1]['accuracy_score']]
p_tuned = None
output_file('tuned_evaluation.html')
p_tuned = figure(title="Tuned Model evaluation with best hyperparameters found",
x_axis_label='# number of target classes',
y_axis_label = 'Average accuracy',
plot_width=500, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
x = [2,3,4,5]
xticks = np.array([2,3,4,5])
p_tuned.xaxis.ticker = xticks
p_tuned.circle(x, tuned_test_accuracy, color= 'blue', size = 6,legend_label = 'Test accuracy')
p_tuned.square(x, tuned_train_accuracy, color= 'red', size = 6,legend_label = 'Trainning accuracy')
p_tuned.legend.location = 'bottom_left'
show(p_tuned)
p_combined = None
output_file('combined_evaluation.html')
p_combined = figure(title="Base Model + Tuned Mode evaluation",
x_axis_label='# number of target classes',
y_axis_label = 'Average accuracy',
plot_width=500, tools= "pan,wheel_zoom,box_zoom,reset,hover"
)
x = [2,3,4,5]
xticks = np.array([2,3,4,5])
p_base.xaxis.ticker = xticks
p_combined.circle(x, tuned_test_accuracy, color= 'orange', size = 6,legend_label = 'Tuned Test accuracy')
p_combined.circle(x, base_test_accuracy, color= 'green', size = 6,legend_label = 'Base Test accuracy')
p_combined.square(x, tuned_train_accuracy, color= 'navy', size = 6,legend_label = 'Tuned Trainning accuracy')
p_combined.square(x, base_train_accuracy, color= 'red', size = 6,legend_label = 'Base Trainning accuracy')
p_combined.legend.location = 'bottom_left'
p_combined.legend.location = 'bottom_left'
show(p_combined)
This shows that base model actually perfomed better than the model with parameters found via GridSearch. It could be due to the fact that range selection was not optmised. In future work, evaluation model using RadomizedGridSearch could be interesting.
I have evaluated 4 different classifications, and results are documented above.
Providig too many parameters to GridSearch lead to very slow search (more than 7 hour for one model),therefore I have reduced the parameter range and then it required on average 20 minutes for searching best parameters
There have been some challenges initially about cleaning up transactions and identifying how to use transaction information for data modeling. The critical insight was to identify the timeline and to relate it to different events that allowed me to draw some great customer journey visualizations and subsequent custom feature creations.
Overall, I find it exciting to work on this project. I learned lots of stuff regarding data analysis and particularly data visualization using bokeh.
Regarding the use of CatBoost, it required some initial trial and error to make it work. Even though I haven't changed much of the default parameters, the results were excellent.
Part of the reason is due to the face that I have included features like 'the number of offers viewed' in the training set, which is an excellent indicator of the offer completion rate, and based on it we have encoded our target classes.
For future refinement
I would suggest to try out custom categorical features to be used for training a model.
Also, removing some custom features from the dataset and evaluating the model would be interesting to know.
Concerning feature engineering, there could be features developed regarding how much the customer has spent before viewing offers.